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Abstract — Dictionaries adapted to the data provide superior 
performance when compared to predefined dictionaries in ap- 
plications involving sparse representations. Algorithmic stability 
and generalization are desirable characteristics for dictionary 
learning algorithms that aim to build global dictionaries which 
can efficiently model any test data similar to the training samples. 
In this paper, we propose an algorithm to learn dictionaries 
for sparse representation of image patches, and prove that 
the proposed learning algorithm is stable and generalizable 
asymptotically. The algorithm employs a 1-D subspace clustering 
procedure, the K-Hnes clustering, in order to learn a hierarchical 
dictionary with multiple levels. Furthermore, we propose a 
regularized pursuit scheme for computing sparse representations 
using a multilevel dictionary. Using simulations with natural 
image patches, we demonstrate the stability and generalization 
characteristics of the proposed algorithm. Experiments also 
show that improvements in denoising performance are obtained 
with multilevel dictionaries when compared to global K-SVD 
dictionaries. Furthermore, we propose a robust variant of multi- 
level learning for severe degradations that occur in applications 
like compressive sensing. Results with random projection-based 
compressive recovery show that the multilevel dictionary and its 
robust variant provide improved performances compared to a 
baseline K-SVD dictionary. 

I. Introduction 
A. Dictionary Learning for Sparse Representations 

THE statistical structure of naturally occurring signals 
and images allows for their efficient representation as a 
sparse linear combination of patterns, such as edges, lines and 
other elementary features IJJ . A finite collection of normalized 
features is referred to as a dictionary. The linear model used 
for general sparse coding is given by 

y = ^a + n, (1) 

where y G is the data vector and ^ = ['0i'02 • • • '^k] ^ 
-^MxK |g ^j^^ dictionary. Each column of the dictionary, 
referred to as an atom, is a representative pattern normalized 
to unit ^2 norm, a G is the sparse coefficient vector and n 
is a noise vector whose elements are independent realizations 
from the Gaussian distribution A/'(0, a^). 
The sparse coding problem can be stated as 

a = argmin ||a||o s.t. ||y - ^a||^ < e, (2) 

a 

where ||.||o indicates the norm, ||.||2 denotes the ^2 norm 
and 6 is the error goal for the representation. However, exact 
^0 minimization is a combinatorial problem and hence its 
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convex surrogate, the li norm, is often used. Some of the 
widely used methods for computing sparse representations 
include the Matching Pursuit (MP) j2|. Orthogonal Matching 
Pursuit (OMP) 1 3 1, Basis Pursuit (BP) ||4j, FOCUSS [5] and 
iterated shrinkage algorithms |[6|, ||7j. The sparse coding model 
has been successfully used for inverse problems in images 
|[8l-pQ|, and also in machine learning applications such as 
classification and clustering |TT|-||2T| 

Predefined dictionaries obtained using the discrete cosine 
transform (DCT), wavelet, and curvelet |22| bases have been 
used successfully for image reconstruction and compression. 
The dictionary ^ can also be designed from a union of 
orthonormal bases |23 | or structured as an overcomplete set 
of individual vectors optimized to the training data [ [24| , 
p5| . A wide range of dictionary learning algorithms have 
been proposed in the literature [,26J-[32 |, some of which are 
tailored for specific applications. The conditions under which 
a dictionary can be identified from the training data using 
an minimization approach are derived in p3| . The joint 
optimization problem for dictionary learning and sparse coding 
with ^0 sparsity constraints can be expressed as jSj, p4| , p5| 

min||Y-*A||2, s.t. ||ai||o < ^, Vi, HV'.-lb = 1, Vj, (3) 

where Y = [yiy2 • • • Yt] is a collection of T training vectors, 
A = [aia2 . . . a^] is the coefficient matrix, S is the sparsity 
of the coefficient vector and ||.||i? denotes the Frobenius norm. 
Learned dictionaries have been successfully applied to image 
compression, denoising and inpainting ^9J, [lOJ . 

In this paper, we propose a stable and generalizable learning 
algorithm for designing multilevel dictionaries that are par- 
ticularly suited for sparse approximation of natural images. 
A simple example of learning a dictionary with two levels 
is demonstrated in Figure [T] The properties and performance 
of this learning algorithm will be analyzed in detail in this 
paper. The multilevel dictionary (MLD) learning algorithm is 
a hierarchical procedure where the dictionary atoms in each 
level are obtained using a 1-D subspace clustering algorithm, 
which we refer to as K-lines clustering | 36| The proposed 
algorithm builds global dictionaries using a set of randomly 
chosen training patches obtained from a large collection of 
natural images that can generalize well to any test set of 
patches. For a learned dictionary to provide a good approxi- 
mation, the test data must be similar to the data samples used 

^Note that in the papers the procedure has been referred to as 

K-hyperline clustering. But in this paper, we prefer to use the term K-lines 
clustering, since a 1-D subspace in any number of dimensions is referred only 
to as a line, and not as a hyperline. 
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Fig. 1. Features learned at two levels from non-overlapping patches (8 x 8) 
of a 128 X 96 image. In each level, the patches that are highlighted in the 
image share similar information and hence jointly correspond to a learned 
pattern (also highlighted). 



for training. Since local regions of natural images have high 
redundancy and consistent statistical properties p9|, learning 
global dictionaries from a random collection of natural image 
patches will provide a good representation for patches from 
images not in the training set. The effectiveness of such 
dictionaries have been demonstrated in denoising ||9| and 
compressed recovery [401 . 

B. Stability and Generalization in Learning 

A learning algorithm is a map from the space of training 
examples to the hypothesis space of functional solutions. 
Algorithmic stability characterizes the behavior of a learning 
algorithm with respect to the perturbations of its training set 
pTI , and generalization ensures that the expected error of the 
learned function with respect to the novel test data will be 
close to the average empirical training error |42|. In clustering, 
the learned function is completely characterized by the cluster 
centers. Stability of a clustering algorithm implies that the 
cluster centroids learned by the algorithm are not significantly 
different when different sets of i.i.d. samples from the same 
probability space are used for training f43 |. When there is a 
unique minimizer to the clustering objective with respect to the 
underlying data distribution, stability of a clustering algorithm 
is guaranteed |44| and this analysis has been extended to 
characterize the stability of K-means clustering in terms of the 
number of minimizers | [45| . In | [38| , the stability properties of 
the K-lines clustering algorithm have been analyzed and they 
have been shown to be similar to those of K-means clustering. 
Note that all the stability characterizations depend only on 
the underlying data distribution and the number of clusters, 
and not on the actual training data itself. Generalization 
implies that the average empirical training error becomes 
asymptotically close to the expected error with respect to the 
probability space of data, as the number of training samples 
T ^ oo. In | [46| , the generalization bound for sparse coding in 
terms of the number of samples T, also referred to as sample 
complexity, is derived and in | [47| the bound is improved by 
assuming a class of dictionaries that are nearly orthogonal. 



The algorithmic stability of dictionary learning methods has 
not been discussed in the literature until now, to the best of 
our knowledge. Given a sufficiently large training set, a stable 
learning algorithm will result in global dictionaries that will 
depend only on the probability space to which the training 
samples belong and not on the actual samples themselves. 
Generalization ensures that such global dictionaries learned 
result in a good performance with test data. In other words, 
the asymptotic stability and generalization of a dictionary 
learning algorithm provide a theoretical justification for the 
uniformly good performance of global dictionaries learned 
from an arbitrary training set. 

C. Contributions 

In this paper, we propose the MLD learning algorithm to 
design global representative dictionaries for image patches. 
We show that, for a sufficient number of levels, the proposed 
algorithm converges, and also demonstrate that a multilevel 
dictionary with a sufficient number of atoms per level exhibits 
energy hierarchy (Section III-C| ). Furthermore, we develop a 
Regularized Multilevel OMP (RM-OMP) procedure for com- 
puting sparse codes for test data using the proposed dictionary 
(Section |III-D| ). Some preliminary algorithmic details and 
results obtained using MLD have been reported in pTj. 



Using the fact that the K-lines clustering algorithm is 
stable, we perform stability analysis of the MLD algorithm. 
For any two sets of i.i.d. training samples from the same 
probability space, as the number of training samples T ^ oo, 
we show that the dictionaries learned become close to each 
other asymptotically. When there is a unique minimizer to 
the objective in each level of learning, this holds true even 
if the training sets are completely disjoint. However, when 
there are multiple minimizers for the objective in at least one 
level, we prove that the learned dictionaries are asymptotically 
close when the difference between their corresponding training 
sets is o{-\/T). Instability of the algorithm when the difference 
between two training sets is 1^(>/T), is also shown for the 
case of multiple minimizers (Section [rV|-C). Furthermore, we 
prove the asymptotic generalization of the learning algorithm 
(Section [TV^ . 

The stability characteristics of MLD learning are experimen- 



tally demonstrated using natural image data (Section |IV-E| ). 
We show that, the stability in terms of the learned dictionar- 
ies improves as the difference between their corresponding 
training sets becomes small and as the number of training 
samples increases. We train a global multilevel dictionary from 
a set of patches chosen randomly from a corpus of natural 
images and study its generalization behavior using several 
simulations. For comparison, we use a dictionary learned 
using the K-SVD algorithm, with similar training parameters, 
for the same training data set. We observe that the error in 
sparse approximation for the training and test data sets become 
comparable as the size of the training set increases. When 
compared to the K-SVD, the proposed algorithm exhibits 
much improved generalization by providing reduced test error 
even with a small number of training samples. The learned 
MLD results in a better denoising performance compared to 
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the global K-SVD dictionary (Section |V]). In order to improve 
recovery performance with severe degradations such as com- 
pressive sensing, we also propose a robust MLD (RMLD) 
procedure that uses multiple random subsets of data to obtain 
approximations in each level. Using compressive recovery of 
randomly projected data, we show that the RMLD improves 
over MLD, which in turn performs better than a baseline K- 
SVD dictionary (Section Ml. 



II. Background 

In this section, we describe the K-lines clustering, a 1-D 
subspace clustering procedure proposed in f36l, which forms 
a building block of the proposed dictionary learning algorithm. 
Furthermore, we briefly discuss the results for stability analysis 
of K-means and K-lines algorithms reported in (4T\ and llF] 
respectively. The ideas described in this section will be used in 
Section [IV| to study the stability characteristics of the proposed 
dictionary learning procedure. 

A. K-lines Clustering Algorithm 

The K-lines clustering algorithm is an iterative procedure 
that performs a least squares fit of K 1-D linear subspaces 
to the training data [36J . Note that the K-lines clustering is a 
special case of general subspace clustering methods proposed 
in |[48|-|[5Q|, when the subspaces are 1— dimensional and 
constrained to pass through the origin. In contrast with K- 
means, K-lines clustering allows each data sample to have 
an arbitrary coefficient value corresponding to the centroid of 
the cluster it belongs to. Furthermore, the cluster centroids 
are normalized to unit £2 norm. Given the set of T data 
samples Y = {yi}f^i and the number of clusters K, K- 
lines clustering proceeds in two stages after initialization: 
the cluster assignment and the cluster centroid update. In 
the cluster assignment stage, training vector is assigned 
to a cluster j based on the minimum distortion criteria, 
'H(yi) = argmin^ (i(y^,i/?^), where the distortion measure is 



^(y,^) = l|y-^(y^^)ll2. 



(4) 



In the cluster centroid update stage, we perform singular 
value decomposition (SVD) of = [yi]ieCj^ where Cj = 
{^l^(yi) = j} contains indices of training vectors assigned 
to the cluster j. The left singular vector corresponding to the 
largest singular value of the decomposition, is the centroid of 
cluster j. Different strategies exist for initialization of cluster 
centroids and estimation of the number of hyperlines p6|. 



B. Stability Analysis of Clustering Algorithms 

Analyzing the stability of unsupervised clustering algo- 
rithms can be valuable in terms of understanding their behavior 
with respect to perturbations in the training set. These algo- 
rithms extract the underlying structure in the training data and 
the quality of clustering is determined by an accompanying 
cost function. As a result, any clustering algorithm can be 
posed as a Empirical Risk Minimization (ERM) procedure, 
by defining a hypothesis class of loss functions to evaluate 
the possible cluster configurations and to measure their quality 



|51 1. For example, K-lines clustering can be posed as an ERM 
problem over the distortion function class 

Qk = \g^{y) = d{y,tl)j)J = argmax ly^i/?^ > . (5) 

The class Qk is obtained by taking functions correspond- 
ing to all possible combinations of K unit length vectors from 
the space for the set ^. Let us define the probability 
space for the data in as (3^, X), P), where y is the sample 
space and X) is a sigma- algebra on y, i.e., the collection of 
subsets of y over which the probability measure P is defined. 
The training samples, {yi}^i, are i.i.d. realizations from this 
space. 

Ideally, we are interested in computing the cluster centroids 
^ that minimize the expected distortion E[^^] with respect 
to the probability measure P. However, the underlying distri- 
bution of the data samples is not known and hence we resort 
to minimizing the average empirical distortion with respect to 
the training samples {yi}J=i as 



1 ^ 

argmin - ^^^(yi) 



geGK 



(6) 



When the empirical averages of the distortion functions in Gk 
uniformly converge to the expected values over all probability 
measures P, 



lim sup I 

T^oo p 



sup 



1 ^ 



> 5 



0, 



(7) 

for any ^ > 0, we refer to the class Qk as uniform Glivenko- 
Cantelli (uGC). In addition to being uGC, if the class also 
satisfies a version of the central limit theorem, it is defined 
as uniform Donsker |41|| . In order to determine if Qk is 
uniform Donsker, we have to verify if the covering number of 
Qk with respect to the supremum norm, A^oo(7,Sk), grows 
polynomially in the dimensions M |43| . Here, 7 denotes 
the maximum L^o distance between an arbitrary distortion 
function in Qk, and the function that covers it. For K-lines 
clustering, the covering number is upper bounded by jSSj 
Lemma 2.1] 

NMk)< '^^) , (8) 



7 / 

where we assume that the data lies in an M-dimensional £2 
ball of radius R centered at the origin. Therefore, Qk belongs 
to the uniform Donsker class. 

The general idea behind stability of a clustering algorithm is 
that the algorithm should produce cluster centroids that are not 
significantly different when different i.i.d. training sets from 
the same probability space are used for training ||43|-||45l. 
Stability is characterized based on the number of minimizers 
to the clustering objective with respect to the underlying data 
distribution. A minimizer corresponds to a function g\$f G Qk 
with the minimum expectation IE[^^]. Stability analysis of the 
K-means algorithm has been reported in ||43|, [45 j . 

Though the geometry of K-lines clustering is different 
from that of K-means, the stability characteristics of the two 
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clustering algorithms have been found to be similar | [38| . 
Given two sets of cluster centroids ^ = 
and A = {Ai,...,Ax} learned from training sets of T 
i.i.d. samples each realized from the same probability space, 
let us define the Li{P) distance between the corresponding 
clusterings as 



1^^ -gAUriP) 



9A{y)\dP{y). (9) 



When T ^ oo, and Qk is uniform Donsker, stability in terms 
of the distortion functions is expressed as 



Wg^ -^aIIli(p) 0, 



(10) 



where — > denotes convergence in probability. This holds true 
even for ^ and A learned from completely disjoint training 
sets, when there is a unique minimizer to the clustering 
objective. When there are multiple minimizers, ([TO]) holds 
true with respect to a change in o(>/T) samples between 
two training sets and fails to hold with respect to a change 
in rt{y/T) samples |38|. The distance between the cluster 
centroids themselves is defined as [43J 



A(*,A) 



max mm 

^<3<Kl<l<K 



(11) 



Lemma 2.1 ( [SF]): If the Li{P) distance between the dis- 
tortion functions for the clusterings ^ and A is bounded as 

Wg^ - ^aIIli(p) < for some /i > 0, and dP{y)/dy > C, 
for some C > 0, then A(^, A) < 2sin(p) where 



p <2 sin 



Cc, 



M 



(12) 



Here the training data is assumed to lie outside an M- 
dimensional £2 ball of radius r centered at the origin, and 
the constant Cc,m depends only on C and M. 



When the clustering algorithm is stable according to (10), 



for admissible values of r. Lemma [271] indicates that the cluster 

— p 
centroids become arbitrarily close to each other, A(^, A) — > 

0, which implies stability in terms of cluster centroids. From 



( 12), it is also clear that the K-lines clustering cannot be stable 
if some training vectors have a norm close enough to 0, (i.e.) 
r ^ 0. 



Secondly, the learning procedure must be provably stable, with 
respect to the notion of algorithmic stability, and generalizable. 

The generative model in ([T]) is well suited for natural signals 
and images as they can be represented using a sparse linear 
combination of elementary features chosen from a dictionary 
|24|. The redundancy in the local regions of natural images 
|39| allows for the design of global dictionaries that can 
generalize well to a wide range of images. Global dictio- 
naries learned from a set of randomly chosen patches from 
natural images have been successfully used for denoising ||9|, 
compressed sensing |40| and classification |52|. In addition 
to exhibiting redundancy, the natural image patches typically 
contain either geometric patterns or stochastic textures or 
a combination of both. This fact is demonstrated in (531 , 
where the authors define two types of atomic subspaces to 
model image patches: subspaces of low dimensions (explicit 
manifolds) for primitive geometric patterns and subspaces of 
high dimensions (implicit manifolds) for stochastic textures. 
Since the image patches can contain both geometric and 
stochastic structures, a hybrid combination of explicit and 
implicit manifolds can be used for modeling them |53|. 
The proposed MLD algorithm learns global representative 
patterns in multiple levels, according to the order of their 
energy contribution. Since the geometric patterns usually are 
of higher energy when compared to stochastic textures in 
images, geometric patterns are learned in the first few levels 
and stochastic textures are learned in the last few levels. 

Considering the dictionary learning formulation in ([3]), it 
can be seen that clustering algorithms such as the K-means 
and the K-lines can be obtained by constraining the desired 
sparsity to be 1. Since the stability characteristics of clustering 
algorithms are well understood, employing similar tools to 
analyze the more general dictionary learning can be beneficial. 
Note that the proposed algorithm poses dictionary learning 
as performing K-lines clustering in multiple levels and hence 
in this case we can use the stability characteristics of the 
clustering algorithm to study the stability of multilevel learn- 
ing. Furthermore, by exploiting the fact that the distortion 
function class for each level of learning is uniform Donsker, 
the generalizability of the algorithm can also be proved. Note 
that multilevel learning is different from the work in |54| , 
where multiple sub-dictionaries are designed and one of them 
is chosen for representing a group of patches. 



III. Multilevel Dictionary Learning 

In this section, we motivate and develop a multilevel dic- 
tionary learning approach for sparse representations, whose 
algorithmic stability and generalizability will be proved in 
Section |IV| Furthermore, we propose the RM-OMP algorithm, 
that can be used to obtain sparse codes for a test image using 
the multilevel dictionary. 

A. Motivation for Multilevel Learning 

Our motivation for learning an MLD is two-fold. Firstly we 
require a global dictionary that can exploit, (a) the redundancy 
observed across local regions in natural images and, (b) 
the hierarchy of patterns found in training image patches. 



B. Proposed MLD Learning Algorithm 

We denote the MLD as ^ = [^1*2. --^l], and the 
coefficient matrix as A = [Af A2^...A|^]^. Here, is the 
sub-dictionary and A^ is the coefficient matrix for level I. The 
approximation in level / is expressed as 

Rz_i = ^^A^ + Ki, for / = 1, L, (13) 

where R^_i, Hi are the residuals for the levels / — 1 and 
/ respectively and Rq = Y, the matrix of training image 
patches. This implies that the residual matrix in level / — 1 
serves as the training data for level I. Note that the sparsity 
of the representation in each level is fixed at 1. Hence, the 
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TABLE I 

Algorithm for building a multilevel dictionary. 



LEVEL 1 LEVEL 2 LEVELS LEVEL 4 



,L}. 



Input 

Y = [yi]^i, M X T matrix of training vectors. 
L, maximum number of levels of the dictionary. 
Ki, number of dictionary elements in level /, I = {1, 2, 
e, error goal of the representation. 

Output 

adapted sub-dictionary for level 



Algorithm 

Initialize: I = 1 and Rq = Y. 

Ao = {i I ||ro,2||2 > e, 1 < i < T}, index of training vectors with 
squared norm greater than error goal. 

R-o = [ro, ^ 



while A;_i / and / < L 
Initialize: 

A;, coefficient matrix, size Ki x M, all zeros. 

R^, residual matrix for level /, size M x T, all zeros. 

= r^. where i = Ai_i{j), Vj = 1, |A;_i|. 
= where i = Ai_i(j), Vj = 1, \Ai_i\. 
^i = {i\ >e,l <z<T}. 

end 



overall approximation for all levels is 

L 

Y = ^^,A,+Rl. (14) 

MLD learning can be interpreted as a block-based dictio- 
nary learning problem with unit sparsity per block, where 
the sub-dictionary in each block can allow only a 1 -sparse 
representation and each block corresponds to a level. The 
sub-dictionary for level is the set of cluster centroids 

learned from the training matrix for that level, Rz-i, using 
K-lines clustering. MLD learning can be formally stated as an 
optimization problem that proceeds from the first level until 
the stopping criteria is reached. For level the optimization 
problem is 

argmin||R|_i - subject to ||a/,i||o < 1, 



fori = {l,...,T}, 



(15) 



along with the constraint that the columns of have 
unit ^2 norm, where ^ is the column of and T 
is the number of columns in We adopt the notation 
{^^,A/} = KLC(R^_i,i^^) to denote the problem in ( p3 ) 
where Ki is the number of atoms in the sub-dictionary ^i. 
The stopping criteria is provided either by imposing a limit 
on the residual representation error or the maximum number 
of levels (L). Note that the total number of levels is the same 
as the maximum number of non-zero coefficients (sparsity) 
of the representation. The error constraint can be stated as, 
||r/,i||2 < e, Vz = 1, T for some level /, where e is the error 
goal. 

Table [l| lists the MLD learning algorithm with sparsity and 
error constraints. We use the notation A^(j) to denote the 
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Fig. 2. Multilevel dictionary, with 16 levels of 16 atoms each, comprises of 
geometric patterns in the first few levels, stochastic textures in the last few 
levels and a combination of both in the middle levels. 



element of the set and ^ denotes the column vector in 
the matrix R^. The set A/ contains the indices of the residual 
vectors of level / whose norm is greater than the error goal. 
The residual vectors indexed by A^ are stacked in the matrix, 
R^, which in turn serves as the training matrix for the next 
level, / + 1. In MLD learning, for a given level the residual 
Ti^i is orthogonal to the representation ^i^i i. This implies 
that 



-l,i|l2 



(16) 



Combining this with the fact that = 
3.1 is 1— sparse, and the columns of 
we obtain the relation 



are of unit £2 norm. 



lly.lli 



L 

E 

^=1 



(17) 



Equation ( 17 ) states that the energy of any training vector is 
equal to the sum of squares of its coefficients and the energy 
of its residual. From ([16]), we also have that. 



|R- 



■^-1 



I*/ A, 



(18) 



In our implementation of MLD learning, we include an addi- 
tional step where the residual at each level is orthogonalized 
to the dictionary atoms chosen so far, and the coefficients are 
recomputed. Note that this does not affect any other behavior 
of the algorithm that is discussed in this section. 

The training vectors for the first level of the algorithm, 
ro,i lie in the ambient space and the residuals, ri^^, lie 
in a finite union of R^~^ subspaces. This is because, for 
each dictionary atom in the first level, its residual lies in 
an M — 1 dimensional space orthogonal to it. In the second 
level, the dictionary atoms can possibly lie anywhere in R^, 
and hence the residuals can lie in a finite union of R^~^ 
and R^~^ dimensional subspaces. Hence, we can generalize 
that the dictionary atoms for all levels lie in R^, whereas 
the training vectors of level / > 2, lie in finite unions of 
R^-\ . . . , R^-^+i dimensional subspaces of the R^ space. 
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C. Convergence 

The convergence of MLD learning and the energy hierarchy 
in the representation obtained using an MLD can be shown 
by providing two guarantees. The first guarantee is that for a 
fixed number of atoms per level, the algorithm will converge 
to the required error within a sufficient number of levels. This 
is because the K-lines clustering makes the residual energy 
of the representation smaller than the energy of the training 
matrix at each level (i.e.) ||R^||i < ||R^-i||i. This follows 



from (|18|) and the fact that ||*^A/|||, > 0. 



The second guarantee is that for a sufficient number of 
atoms per level, the representation energy in level / — 1 will 
be less than the representation energy in level To show this, 
we first state that for a sufficient number of dictionary atoms 
per level, ||^/A/|||. > ||Rz|||^. This means that for every / 



IR/ 



< W^iAiWi. < IIR. 



1^, 



(19) 



(18). This implies that ||*^A^|||, < 



because of 

||^^_iA/_i||^, i.e., the energy of the representation in 
each level reduces progressively from / = 1 to / = L. 



D. Sparse Approximation using an MLD 

In order to compute sparse codes for novel test data using a 
multilevel dictionary, we propose to perform reconstruction 
using a Multilevel Orthogonal Matching Pursuit (M-OMP) 
procedure which evaluates a 1- sparse representation for each 
level using the dictionary atoms from that level, and orthog- 
onalizes the residual to the dictionary atoms chosen so far. 
Though asymptotic generalization of the M-OMP method will 
be shown in Section IV-D imposing the energy hierarchy 
observed in the training process to any test data might result 
in poor generalization. Hence, there is a need to regularize this 
procedure such that there is more flexibility in choosing dictio- 
nary atoms for representing the test data. Hence, we propose 
to build a sub-dictionary with atoms selected from the current 
level as well as the u immediately preceding and following 
levels, = [$^_n*i-(n-i) ...*z...*i+(n-i)*z+n], m 
every step of the pursuit algorithm. In our implementation, 
we fix = 2 and also reduce the size of the sub-dictionary 
appropriately when / < u and / > L — u. The dictionary is 
used to compute a 1 -sparse representation for that step of the 
pursuit. It was observed from simulations that the RM-OMP 
scheme performs better than M-OMP, particularly when the 
training set is small. 

E. Demonstration of MLD Learning 

All simulation results presented in this paper were obtained 
with dictionaries learned using randomly chosen patches of 
size 8x8, extracted from the grayscale images in the training 
set of the Berkeley segmentation dataset (BSDS) | [55| . The 
number of grayscale patches used for training will be clearly 
stated for each simulation. As a preprocessing step, the mean 
value of each training patch was removed. In this section, we 
will demonstrate the characteristics of MLD learning using 
an example dictionary learned using 50,000 patches. Note 
that, the number of atoms was fixed at 16 per level and the 




Fig. 3. Levelwise representation energy for the learned MLD with the BSDS 
training data set 



number of levels was fixed at 16, which leads to a total of 
256 atoms. For comparison, a global K-SVD dictionary of 
size 64 X 256 atoms was learned, with the same training set, 
using the MATLAB toolbox available online |56|. In this case, 
the desired sparsity, which refers to the number of non-zero 
coefficients (S), was fixed at 16. Initial dictionary atoms for 
the K-SVD algorithm and for each level of MLD learning 
were obtained using the K-means clustering procedure. 

Figure [2] illustrates the multilevel dictionary designed using 
the algorithm in Table [l| Note that no noise was added to 
the image patches during learning. As it can be observed, 
the learned MLD contains geometric patterns in the first 
few levels, stochastic textures in the last few levels and a 
combination of both in the middle levels. The representation 
energy, ||^^A^|||., captured across all the levels in MLD is 
shown in Figure [3] where the energy hierarchy in learning 
can be clearly seen. 

Given a multilevel dictionary, an iS— sparse representation 
for a test sample can be evaluated using the M-OMP or 
the RM-OMP procedures described in Section |III-D[ For the 
learned K-SVD and multilevel dictionaries, we computed the 
sparse codes for patches from a test dataset, by varying the 
desired sparsity. The test dataset consisted of 120,000 non- 
overlapping 8x8 patches extracted from images in the BSDS 
test images. The illustration in Figure |4] shows the mean 
squared error (MSE) of the representation as a function of 
the number of non-zero coefficients. For the case of MLD, 
the results obtained using both the M-OMP and the RM-OMP 
schemes are shown. The OMP algorithm was employed to 
compute the sparse coefficients with the K-SVD dictionary. 
It can be observed that the MSE obtained using the M-OMP 
procedure is higher in all cases of sparsity, when compared 
to RM-OMP. Since the RM-OMP procedure considers dic- 
tionary atoms from the neighboring levels when computing 
a coefficient, it results in an improved generalization. When 
compared to K-SVD, multilevel dictionaries lead to a more 
accurate reconstruction when the sparsity level 5 > 4, which 
is the range typically used in several applications. 
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Sparsity (# of non-zero coefficients) 

Fig. 4. Comparison of the MSB obtained with the BSDS test dataset using 
the K-SVD and the MLD dictionaries at different levels of sparsity. 



IV. Stability and Generalization 

In this section, the behavior of the proposed dictionary 
learning algorithm is considered from the viewpoint of algo- 
rithmic stability: the behavior of the algorithm with respect 
to the perturbations in the training set. It will be shown that 
the dictionary atoms learned by the algorithm from two dif- 
ferent training sets whose samples are realized from the same 
probability space, become arbitrarily close to each other, as the 
number of training samples T ^ oo. Since the proposed MLD 
learning is equivalent to learning K-lines cluster centroids in 
multiple levels, the stability analysis of K-lines clustering |38J, 
briefly discussed in Section II-B will be utilized in order to 
prove its stability. For each level of learning, the cases of 
single and multiple minimizers to the clustering objective will 
be considered. Proving that the learning algorithm is stable 
will show that the global dictionaries learned from the data 
depend only on the probability space to which the training 
samples belong and not on the actual samples themselves, as 
T ^ oo. We also show that the MLD learning generalizes 
asymptotically, i.e., the difference between expected error 
and average empirical error in training approaches zero, as 
T ^ oo. Therefore, the expected error for novel test data, 
drawn from the same distribution as the training data, will be 
close to the average empirical training error. 

The stability analysis of the MLD algorithm will be per- 
formed by considering two different dictionaries ^ and A with 
L levels each. Each level consists of Ki dictionary atoms and 
the sub-dictionaries in each level are indicated by and 
respectively. Note that the sub-dictionaries and A^ are the 
cluster centers learned using K-lines clustering on the training 
data for level /. The steps involved in proving the overall 
stability of the algorithm are: (a) showing that each level of 
the algorithm is stable in terms of Li{P) distance between the 
distortion functions, defin ed in ([9] ), as the number of training 
samples T ^ oo (Section [IV-A ), (b) proving that stability in 
terms of Li{P) distances indicates closeness of the centers 
of the two clusterings (Section IV-B| ), in terms of the metric 
defined in ( pTj ), and (c) showing that level- wise stability leads 
to overall stability of the dictionary learning algorithm (Section 



A. Level-wise Stability 

Let us define a probability space (3^^,5]^,P^) where yi is 
the data that lies in R^, and Pi is the probability measure. 
The training samples for the sub-dictionaries and Ai are 
two different sets of T i.i.d. realizations from the probability 
space. We also assume that the £2 norm of the training samples 
is bounded from above and below (i.e.), 0<r<||y||2<i^< 
00. Note that, in a general case, the data will lie in for 
the first level of dictionary learning and in a finite union of 
lower-dimensional subspaces of R^ for the subsequent levels. 
In both cases, the following argument on stability will hold. 
This is because when the training data lies in a union of lower 
dimensional subspaces of R^, we can assume it to be still 
lying in R^, but assign the probabilities outside the union of 
subspaces to be zero. 

In each level, and A^ are learned using the K-lines 
clustering algorithm on two different i.i.d. sets of training data. 
The distortion function class for the clusterings, defined similar 
to ([5]), is uniform Donsker because the covering number 
with respect to the supremum norm grows polynomially, 
according to ([8]). When a unique minimizer exists for the 
clustering objective, the distortion functions corresponding to 
the different clusterings and A^ become arbitrarily close, 
Ib^z — 9Ai \\li{Pi) — ^ 0. ^ven for completely disjoint training 
sets, as T ^ 00. However, in the case of multiple minimizers, 
Ib^z — gAi \ \li{Pi) — ^ holds only with respect to a change 
of o{Vt) training samples between the two clusterings, and 
fails to hold when there is a change of Q{Vt) samples [38J , 
|43l. 



B. Distance between Cluster Centers for a Stable Clustering 

For each cluster center in the clustering we pick the 
closest cluster center from A^, in terms of the distortion 
measure and form pairs. Let us indicate the j^^ pair of 
cluster centers by 1/?^ ^ and Xij. Let us define r disjoint sets 
{Ai}J^-^, in which the training data for the clusterings exist, 
such that Pi{[Jj^-^Ai) = 1. By defining such disjoint sets, we 
can formalize the notion of training data lying in a union of 
subspaces of R^. The intuitive fact that the cluster centers 
of two clusterings are close to each other in R^ space, given 
that their distortion functions are close, is proved in the lemma 
below. 

Lemma 4.1: Consider two sub-dictionaries (clusterings) 
and A^ with Ki atoms each obtained using the T training 
samples that exist in the r disjoint sets in the R^ 

space, with < r < ||y||2 < R < 00, and dPi{y)/dy > 
C in each of the sets. When the distortion functions become 
arbitrarily close to each other, \\gq,^ — ^aJIli(Pz) — > as 
T ^ 00, the smallest angle between the subspaces spanned 
by the cluster centers becomes arbitrarily close to zero, i.e., 

Z(V^,,^.,A,,,) Ao,,VjGl,...,i^,. (20) 

Proof: Denote the smallest angle between the subspaces 
represented by t/^ij and A^^ as Xij) = pij and 

define a region S{'ipij,pij/2) = {y|Z('0^ •,y) < pij /2,0 < 
r < \\yh < < 00}. If y G ^(^,,^.,pz,,/2), then 
y^{I-tl^ijtl^fj)y < y^{I-XijXfj)y. An illustration of this 
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cos'^OxiJ A holds for all y G 5, 



then 



Fig. 5. Illustration for showing the stability of cluster centroids from the 
stability of distortion function. 



setup for a 2-D case is given in Figure |5] In this figure, the 
arc qiq2 is of radius r and represents the minimum value of 
||y II2. By definition, the Li (Pi) distance between the distortion 
functions of the clusterings for data that exists in the disjoint 
sets {Ai}J^^ is 



{y)\dPi{y). (21) 



For any j and i with a non-empty Bi^ij = S{tl^ij, pij /2)nAi 
we have, 

ll^^z -9Ai\\l^{Pi) > / \9^i{y) - gAi{y)\dPi{y), (22) 

r ^ 
= / [y^ (i - Xi,,\TA y - - ^i,k^T,k) y 

I(y closest to '0,,^) ] (iPi(y), 



> f [y^ (i - y - y^ (i - y]^^^(y) 



dy. 



(24) 
(25) 



We have gAi{y) = y^ (l - ><i,j>Hj) Y in (23), since Xij is 



the closest cluster center to the data in S{tl)ij^ pij /2) f] Ai in 
terms of the distortion measure (|4l). Note that I is the indicator 



function and (25) follows from ( [24| ) because dPi(y) /dy > C. 
— — p 

Since by assumption, ||^^^ — 5'aJ|li(Pz) — > 0. from (25), we 

have 

(y-v^,,)'-(y"A,,)^ 



0, 



(26) 



because the integrand in ( [25] ) is a continuous non-negative 
function in the region of integration. 

Denoting the smallest angles between y and the subspaces 
spanned by 1/?^ ^ and Xi^j to be 0^^ . and Ox^ . respectively. 



from (26), we have ||y ||2(cos^ 



'1,3 



COS^ Oy 



0, for 



all y. By definition of the region we have . < 

Oxi-. Since ||y||2 is bounded away from zero and infinity, if 



(cos^ 0^ 

we have Xij) 0. This is true for all cluster center 

pairs as we have shown this for an arbitrary i and j. ■ 

C. Stability of the MLD Algorithm 

The stability of the MLD algorithm as a whole, is proved in 



Theorem 4.3 from its level- wise stability by using an induction 
argument. The proof will depend on the following lemma 
which shows that the residuals from two stable clusterings 
belong to the same probability space. 

Lemma 4.2: When the training vectors for the sub- 
dictionaries (clusterings) and A/ are obtained from the 
probability space (3^^, X)^, P^), and the cluster center pairs 
become arbitrarily close to each other as T ^ 00, the resid- 
ual vectors from both the clusterings belong to an identical 
probability space E^+i, P^+i). 

Proof: For the j*^ cluster center pair 1/^^ ^ , A/,j, define 
"^i^j and Ki^j as the projection matrices for their respective 
orthogonal complement subspaces and A[^^ . Define the 
sets D^_^^^ = {y e ^i^j{(3 + d(3) + ^^^^-a} and P>Az,, = 
{y G ^ij{l3 + d/S) + Xija}, where —00 < a < 00, /3 
is an arbitrary fixed vector, not orthogonal to both 1/?^ ^ and 
Xij, and d/3 is a differential element. The residual vector set 
for the cluster ^l^ij, when y G D^^ is given by, r^^ . G 



{^^,^y|y G D^^ J, or equivalently r^^ ^. G + d^)}. 

Similarly for the cluster Xij, we have r^^ ^ G {Aij{/3-\-d/3)}. 
For a 2-D case. Figure [6] shows the 1-D subspace 1/?^ ^ , its 
orthogonal complement V^^^^ , the set D^p^ and the residual 
set {^ij{f3^df3)}. 

In terms of probabilities, we also have that Pi (y G D^^ . ) = 
Pi-^i{r^^ . G {'^ij{l3 + <i/3)}), because the residual set 
{4/ij{/3 + <i/3)} is obtained by a linear transformation of 
D^^ . Here Pi and P^+i are probability measures defined 

(23) 

on the training data for levels / and / + 1 respectively. 
Similarly, P,(y G Dx,J = P/+i(rA,,, e {A,,,(/3 + d/3)}). 
When T ^ 00, the cluster center pairs become arbitrarily 
close to each other, i.e., Z{tl;ij^ Xij) — > 0, by assumption. 
Therefore, the symmetric difference between the sets D^^ . 
and Dxi J approaches the null set, which implies that P/(y G 
D^^ J -Pi{y G P>A,,,) ^ 0. This implies. 



Pi^,{r^^^^e{^i,j{(3^d(3)})- 

P,+i(rA,, G{A,,,(/3 + c^/3)})^0, 



(27) 



for an arbitrary /3 and d/S, as T 00. This means that the 
residuals of 1/?^ ^ and Xij belong to a unique but identical 
probability space. Since we proved this for an arbitrary / and j, 
we can say that the residuals of clusterings ^i and A^ belong 
to an identical probability space given by (3^^+i, P^+i). 

■ 

Theorem 4.3: Given that the training vectors for the first 
level are generated from the probability space (J^i, Xli, Pi), 
and the norms of training vectors for each level are bounded 
as < r < ||y||2 < R < 00, the MLD learning algorithm is 
stable as a whole. 

Proof: The level-wise stability of MLD was shown in 



Section IV- A for two cases: (a) when a unique minimizer 
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D 



where the training samples for level / given by {yi,i}f=i are 




Fig. 6. The residual set {"^ij (/3 + d/3)}, for the 1-D subspace 'ipij, lying 
in its orthogonal complement subspace 



exists for the distortion function and (b) when a unique 
minimizer does not exist. Lemma [4~T] proved that the stability 
in terms of closeness of distortion functions implied stability 
in terms of learned cluster centers. For showing the level- wise 
stability, we assumed that the training vectors in level / for 
clusterings and Ai belonged to the same probability space. 
However, when learning the dictionary, this is true only for the 
first level, as we supply the algorithm with training vectors 
from the probability space Ei, Pi). 

We note that the training vectors for level / + 1 are residuals 
of the clusterings and A^. Lemma 4.2 showed that the 
residuals of level / for both the clusterings belong to an 
identical probability space (y^+i, P^+i), given that the 



training vectors of level / are realizations from the probability 
space (yi^Tii^Pi) and T oo. By induction, this along 
with the fact that the training vectors for level 1 belong to 
the same probability space (J^i, Ei, Pi), shows that all the 
training vectors of both the dictionaries for any level / indeed 
belong to a probability space (3^^, 5]/, P/) corresponding to that 
level. Hence all the levels of the dictionary learning are stable 
and the MLD learning is stable as a whole. ■ 
If there is a unique minimizer to the clustering objective 
in all levels of MLD learning, then the MLD algorithm is 
stable even for completely disjoint training sets, as T ^ oo. 
However, if there are multiple minimizer s in at least one level, 
the algorithm is stable only with respect to a change of o{VT) 
training samples between the two clusterings. In particular, a 
change in Q{VT) samples makes the algorithm unstable. 



D. Generalization Analysis 

Since our learning algorithm consists of multiple levels, and 
cannot be expressed as an ERM on a whole, the algorithm can 
be said to generalize asymptotically if the sum of empirical 
errors for all levels converge to the sum of expected errors, as 
T ^ oo. This can be expressed as 



^ L T L 
T ^ ^ ~ ^ 



^=1 i=l 



1=1 



(28) 



obtained from the probability space (3^^, 5]^, P^). When ( |28[ ) 
holds and the learning algorithm generalizes, it can be seen 
that the expected error for test data which is drawn from the 
same probability space as that of the training data, is close to 
the average empirical error. Therefore, when the cluster centers 
for each level are obtained by minimizing the empirical error, 
we are guaranteed that the expected test error will also be 
small. 

In order to show that ( [28] ) holds, we use the fact that each 
level of MLD learning is obtained using K- lines clustering. 
Hence, from ([7|, the average empirical distortion in each level 
converges to the expected distortion as T ^ oo. 



1 ^ 



0. 



(29) 



The validity of the condition in ( [28] ) follows directly from the 
triangle inequality. 



L T 



;=i ^=1 
1 ^ 



^=1 i=l 



L 

^=1 



(30) 



If the M-OMP coding scheme is used for test data, and 
the training and test data for level 1 are obtained from the 
probability space (3^i, Ei, Pi), the probability space for both 
training and test data in level / will be (3^/, X)^, P^). This 
is because, both the M-OMP coding scheme and the MLD 
learning associate the data to a dictionary atom using the 
maximum absolute correlation measure and create a residual 
that is orthogonal to the atoms chosen so far. Hence, the 
assumption that training and test data are drawn from the same 
probability space in all levels hold and the expected test error 
will be similar to the average empirical training error. 

E. Simulations 

Both stability and generalization are crucial for building 
effective global dictionaries to model natural image patches. 
Although it is not possible to demonstrate the asymptotic 
behavior experimentally, we study the changes in the behavior 
of the learning algorithm with increase in the number of 
samples used for training. 

In order to illustrate the stability characteristics of MLD 
learning, we setup an experiment where we consider a mul- 
tilevel dictionary of 4 levels, with 8 atoms in each level. We 
extracted patches of size 8x8 from the BSDS training images 
and trained multilevel dictionaries using different number of 



training patches T. As we showed in Section IV asymptotic 
stability is guaranteed when the training set is changed by not 
more than o{VT) samples. The inferred dictionary atoms will 
not vary significantly, if this condition is satisfied. 

We fixed the size of the training set at different values 
T = {1000, 5000, 10000, 50,000, 100000} and learned an 
initial set of dictionaries using the proposed algorithm. The 
second set of dictionaries were obtained by replacing different 
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— e — Number of Training Sampies = 1 000 
- Numi:er of Training Sampies = 5000 
— S — Number of Training Sampies = 1 0000 
— \>-' Number of Training Sampies = 50000 
Number of Training Sampies = 100000 



Difference in Number of Training Samples 

Fig. 7. Demonstration of the stability behavior of the proposed MLD 
learning algorithm. The minimum Frobenius norm between difference of two 
dictionaries with respect to permutation of their columns and signs is shown. 
The second dictionary is obtained by replacing different number of samples 
in the training set, used for training the original dictionary, with new data 
samples. 



number of samples from the original training set. For each case 
of T, the number of replaced samples was varied between 
100 and T. For example, when T = 10000, the number of 
replaced training samples were 100, 1000, 5000, and 10000. 
The amount of change between the initial and the second set 
of dictionaries was quantified using the minimum Frobenius 
norm of their difference with respect to permutations of their 
columns and sign changes. In Figure [Tj we plot this quantity 
for different values of T as a function of the number of 
samples replaced in the training set. For each case of T, the 
difference between the dictionaries increases as we increase 
the replaced number of training samples. Furthermore, for a 
fixed number of replaced samples (say 100), the difference 
reduces with the increase in the number of training samples, 
since it becomes closer to asymptotic behavior. 

Generalization of a dictionary learning algorithm guarantees 
a small approximation error for a test data sample, if the 
training samples are well approximated by the dictionary. In 
order to demonstrate the generalization characteristics of MLD 
learning, we designed dictionaries using different number of 
training image patches, of size 8x8, from the BSDS training 
dataset and evaluated the sparse codes for patches in the BSDS 
test dataset (Section |III-E| ). The dictionaries were learned 
at 16 levels with 16 atoms per level. Figure [8] shows the 
approximation error (MSE) for both the training and test 
datasets obtained using multilevel dictionaries. Furthermore, 
the corresponding MSE for the case of similarly designed K- 
SVD dictionaries are included for comparison. In all cases, 
the sparsity in training and testing was fixed at 5 = 16. As it 
can be observed, with MLD, the difference between the MSE 
for training and test data is small even for a small training 
set. However, the K-SVD dictionaries resulted in much higher 
MSE difference for a small training set, although the MSE 
with training data is similar for both MLD and KSVD. Note 
that, in both cases, the approximation error for the test data 
reduces with the increase in the size of the training set. 
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Fig. 8. Demonstration of the generalization characteristics of the proposed 
algorithm compared to K-SVD. We plot the MSE obtained by representing 
patches from the BSDS test dataset, using dictionaries learned with different 
number of training patches. For comparison, we show the training error 
obtained in each case. 



V. Application: Denoising 

Our goal in denoising is to recover the clean image Y 
from the noisy observed image X. The image X is divided 
into patches of size 8x8 with an overlap of 1 pixel, and 
these patches are vectorized and stacked in the matrix X. A 
noisy observation x (a column in X), can be represented as a 
corrupted version of its corresponding clean patch, x = y + 77, 
where rj is the AWGN vector with standard deviation a. Patch- 
wise recovery was performed using RM-OMP with the global 
MLD dictionary learned with 50, 000 patches as described in 



Section IH-E Patchwise error goal was fixed and image-level 
reconstruction constraints were posed as described in |9 |. All 
results were averaged over 5 iterations. Note that, under low- 
noise conditions dictionaries learned from the noisy test image 
itself perform better than global dictionaries. However, under 
high-noise conditions, global dictionaries perform comparably 
to image- specific dictionaries. This results in a significant 
computational advantage since it is not necessary to train a 
separate dictionary for each noisy image. Furthermore, we 
focus on global dictionary learning in this paper and hence we 
compare the results of global MLD and K-SVD dictionaries 
in Table [n| for high-noise conditions (a > 20). For global K- 
SVD dictionary, the results reported in |[9l were used. It can 
be seen that, in almost all the cases, global MLD performs 
better than global K-SVD dictionaries. The denoised Lena 
and Fingerprint images are shown in Figure |9] for a = 20 
and sigma = 50 respectively, where a clear improvement 
in reconstruction performance is observed. Computationally, 
denoising using MLD is less expensive compared to using K- 
SVD as seen from Table III All the times reported in this 



paper are obtained using MATLAB 2012a on a 2.8 GHz, 8- 
core Intel i7 Linux machine. 

VI. Application: Compressed Recovery 

In compressed recovery, the test image is recovered using 
low-dimensional random projections obtained from its patches. 
The finite size of the training set and the lack of robustness 
in the initialization of K-lines clustering can affect the gener- 
alization of multilevel dictionaries to test observations, under 
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TABLE II 

PSNR (dB) of THE DENOISED STANDARD IMAGES CORRUPTED WITH AWGN OF STANDARD DEVIATION (7. IN EACH CASE, THE AVERAGE OF 5 TRIALS 

IS PROVIDED. Higher performance is shown in bold font. 



l^Ulac \0 ) 


Image 


Barbara 


Boat 


Fingerprint 


House 


Lena 


K-SVD 


MLD 


K-SVD 


MLD 


K-SVD 


MLD 


K-SVD 


MLD 


K-SVD 


MLD 


20 


28.87 


29.15 


30.24 


30.31 


28.21 


28.37 


32.88 


32.93 


32.27 


32.44 


25 


27.57 


27.91 


29.17 


29.26 


26.94 


27.22 


31.82 


31.99 


31.2 


31.37 


50 


24.06 


24.15 


25.91 


25.97 


22.68 


23.36 


27.91 


27.98 


27.77 


27.89 


75 


22.54 


22.57 


24.02 


24.06 


19.73 


20.24 


25.33 


25.42 


25.81 


25.92 


100 


21.73 


21.72 


22.83 


22.92 


18.23 


18.72 


23.86 


24.06 


24.45 


24.51 




(a) Original Image 





(b) Noisy Image (22.11 dB) 





(c) K-SVD (28.89 dB) 





(d) MLD (29.19 dB) 




(a) Original Image 



(b) Noisy Image (14.15 dB) 



(c) K-SVD (22.78 dB) 



(d) MLD (23.38 dB) 



Fig. 9. Original, noisy and denoised Lena and Fingerprint images with their respective PSNRs. Reconstructed images for global K-SVD dictionaries are 
obtained using the K-SVD toolbox (56). 



TABLE III 

Average Time(seconds) taken in MATLAB for denoising images 
of size 512 x 512 under different noise conditions. 



Method 


a = 20 


0- = 25 


cr = 50 


a = 75 


a = 100 


K-SVD |56j 
MLD 


16.47 
8.79 


14.93 
8.52 


11.43 
7.96 


9.58 
7.54 


8.61 
7.11 



such severe degradation. The learning procedure can be made 
robust by using multiple clusterings in each level, where each 
clustering is learned from a random subset of the training 
samples for that level. 

Robust Multilevel Dictionaries: Learning robust multilevel 
dictionaries (RMLD) is closely related to Rvotes jSTj, a super- 
vised ensemble learning method. The Rvotes scheme randomly 
samples the training set to create D sets of T]j samples each, 
where Td ^T. The final prediction is obtained by averaging 
the predictions from the multiple hypotheses learned from the 
training sets. For learning level / in RMLD, we draw D subsets 



original training set Yi of size T, allowing for overlap across 
the subsets. The superscript here denotes the index of the 
subset. For each subset y[^^ of size Td <C T, we learn a sub- 
dictionary ^[^^ with K atoms using K-lines clustering. For 
each training sample in Y/, we compute 1— sparse represen- 
tations using all the D sub-dictionaries, and denote the set of 
coefficient matrices as {aJ'^^}^^^. The approximation for the 
training sample in level /, y^^^, is computed as the average of 
approximations using all D sub-dictionaries, ^ ^[^^aj^\ 
The ensemble approximations for all training samples in the 
level can be used to compute the set of residuals, and this 
process is repeated for a desired number of levels, to obtain a 
robust multilevel dictionary (RMLD). Because of its improved 
robustness, reconstruction of test data with an RMLD can be 
performed using simple level-wise approximation, in contrast 
to the RM-OMP procedure with an MLD. We obtain 1— sparse 
approximations on sub-dictionaries in each level, average the 
approximations, compute the residual, and repeat this process 
for the subsequent levels. 



of randomly chosen training samples, {y[^^}£^^ from the Results: The performance of compressed recovery based on 
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TABLE IV 

PSNR (dB) of THE IMAGES RECOVERED FROM COMPRESSED MEASUREMENTS OBTAINED USING GAUSSIAN RANDOM MEASUREMENT MATRICES. 

Results obtained using the proposed MLD, and RMLD dictionaries, along with K-SVD, are shown for different measurement 
noise conditions and number of measurements. higher psnr for each case is indicated in bold font. 



Measurement 
SNR 

(dB) 


Method 


Image 


Barbara 


Boat 


House 


Lena 


Man 


N=8 


N=16 


N=32 


N=8 


N=16 


N=32 


N=8 


N=16 


N=32 


N=8 


N=16 


N=32 


N=8 


N=16 


N=32 





K-SVD 


19.8 


20.54 


21.51 


21.48 


22.07 


23.42 


22.57 


23.91 


25.48 


23.28 


24.23 


26.16 


22.23 


23.2 


24.9 


MLD 


19.96 


20.63 


21.9 


21.68 


22.38 


23.56 


22.73 


23.98 


25.54 


23.3 


24.51 


26.43 


22.41 


23.59 


25.18 


RMLD 


21.55 


22.02 


22.6 


23.05 


23.76 


24.25 


24.2 


24.97 


26.44 


24.1 


25.38 


26.11 


23.89 


25.15 


25.66 


15 


K-SVD 


20.93 


21.89 


24.34 


23.03 


25.02 


27.39 


24.91 


26.87 


31.01 


25.01 


28.08 


31.42 


24.02 


26.02 


28.54 


MLD 


21.17 


22.41 


24.95 


23.42 


25.26 


27.83 


25.06 


27.15 


31.37 


25.29 


28.29 


31.55 


24.31 


26.19 


28.82 


RMLD 


22.58 


24.16 


26.17 


24.82 


26.69 


29.48 


26.41 


28.79 


31.38 


26.89 


29.11 


31.4 


25.51 


27.62 


30.04 


25 


K-SVD 


21.43 


22.09 


24.99 


23.45 


25.88 


28.7 


25.1 


27.1 


31.6 


25.27 


29.03 


31.83 


24.17 


26.59 


29.36 


MLD 


21.56 


22.62 


25.26 


23.69 


25.27 


28.82 


25.36 


27.31 


31.78 


25.48 


28.64 


32 


24.42 


26.71 


29.59 


RMLD 


22.65 


24.33 


26.72 


25.12 


27.07 


29.68 


26.94 


29.04 


32.38 


27.58 


29.55 


32.36 


25.92 


27.79 


30.38 




(a) K-SVD (26.25 dB) (b) MLD (26.59 dB) (c) RMLD (27.81 dB) 



Fig. 10. Compressed recovery of images from random measurements (N = 16, SNR of measurement process = 15dB) using the different dictionaries. In 
each case the PSNR of the recovered image is also shown. 



TABLE V 

Average Time(seconds) taken in MATLAB for training 
dictionaries, with 50, 000 samples, and recovering images of 
size 512 x 512 using different number of random measurements. 



Method 


Training 


N = 8 


N = 16 


N = 32 


K-SVD j56j 
MLD 
RMLD 


675 
502 
1980 


0.07 
0.05 
0.45 


0.09 
0.11 
1.05 


0.10 
0.19 
2.31 



random measurement systems is compared for global MLD, 
RMLD and K-SVD dictionaries. Sensing and recovery were 
performed on a patch-by -patch basis, on non-overlapping 
patches of size 8x8. MLD and K-SVD dictionaries were 
learned with 50,000 BSDS patches as described in Section 
IH-E For learning the RMLD, wq fix K = 16 and obtain 



D = 20 rounds of K- lines dictionaries in each level (L = 16) 
using random sets of training data. The measurement process 
is described as x = ^^a + ry where ^ is the dictionary, 
^ is the measurement or projection matrix, r] is the AWGN 
vector added to the measurement process, x is the output of 
the measurement process, and a is the sparse coefficient vector 
such that y = ^a. The size of the data vector y is M x 1, that 
of ^ is M X that of the measurement matrix ^ is N x M, 
where N < M, and that of the measured vector x is A/" x 1. The 



entries in the random measurement matrix were independent 
realizations from a standard normal distribution. We recover 
the underlying image from its compressed measurements, us- 
ing the K-SVD, MLD, and RMLD dictionaries. For each case, 
we present average results from 100 trial runs, each time with 
a different measurement matrix. The recovery performance 
was evaluated for several standard images and reported in 



Table IV MLD outperforms K-SVD dictionaries in all cases. 
Furthermore, the proposed RMLD algorithm results in much 
improved recovery, for increased complexity during training 
and testing phases. The average time taken for recovering a 
512x512 image using the three proposed dictionaries are listed 
in Table |V] Figure \T0\ illustrates the recovered images obtained 
using different dictionaries with random measurements. 

VIL Conclusions 

We presented a multilevel learning algorithm to design 
global dictionaries that exploit the redundancy and energy 
hierarchy found in natural image patches. The proposed al- 
gorithm employs K-lines clustering to learn atoms for each 
level. We showed that the algorithm converges for a sufficient 
number of levels and that energy hierarchy is exhibited for a 
sufficient number of atoms per level. We also showed that the 
dictionaries learned using different sets of training data, from 
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the same probability space, are arbitrarily close to each other, 
for a sufficiently large number of data samples. Furthermore, 
we proved the asymptotic generalization characteristics, and 
demonstrated the stability and generalization behavior using 
simulations. Simulation results for denoising and compressed 
sensing clearly demonstrated that the learned MLD provide 
superior performance when compared to K-SVD dictionaries. 
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